About the project

My thoughts about the course

I am exited about this course and eager to learn more about the statistics behind machine learning and the different algorithms. I heard about this course when I took a couple of summer courses that Kimmo had.

This is my GitHub repository: link

Week 2

I have taken some courses where regressionhas been a topic before, so this was not so many new things. However, always good to repeat.

Week 3

This week was tricker than last week. I had very little time to work on the tasks this time and had to skip some of the theory. I will catch up on it in the end of this week.


Regression Analysis

Analysis task 1

imported_file <-read.csv(file="learning2014.csv")
head(imported_file)
##   gender age attitude     deep  stra     surf points
## 1      F  53       37 3.583333 3.375 2.583333     25
## 2      M  55       31 2.916667 2.750 3.166667     12
## 3      F  49       25 3.500000 3.625 2.250000     24
## 4      M  53       35 3.500000 3.125 2.250000     10
## 5      M  49       37 3.666667 3.625 2.833333     22
## 6      F  38       38 4.750000 3.625 2.416667     21
str(imported_file)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

The data has 166 rows (observations) and 7 columns (variables). It is a survey of approaches to learning, conducted in 2014 at University of Helsinki. The number of students responded is 183.

Analysis task 2

pairs(imported_file[-1])

summary(imported_file)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :14.00   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :32.00   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :31.43   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :50.00   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00
par(mfrow=c(2,3))
  for(i in 2:7) {
  boxplot(imported_file[,i], main=names(imported_file)[i])
}

Conserning the distribution of the variables: age is quite centered betwwen 20 and 30 but have quite an amount of older outliers. Attitude and stra hav an even distribution centered around 30 and 3 respectivly, deep has some otliers with low scores, surf is quite evenly distributed but has one outlier with higher value. Poits is a bit skewed to higher values but with quite many persons with lower values too. Concerning the relationship between variables: points and attitude seem to have a connection but relationship between other variables is difficult to see based on the scatterplot diagram.

Analysis task 3 and 4

fit <- lm(points~attitude + age + stra,data=imported_file) 
summary(fit)
## 
## Call:
## lm(formula = points ~ attitude + age + stra, data = imported_file)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.1149  -3.2003   0.3303   3.4129  10.7599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.89543    2.64834   4.114 6.17e-05 ***
## attitude     0.34808    0.05622   6.191 4.72e-09 ***
## age         -0.08822    0.05302  -1.664   0.0981 .  
## stra         1.00371    0.53434   1.878   0.0621 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared:  0.2182, Adjusted R-squared:  0.2037 
## F-statistic: 15.07 on 3 and 162 DF,  p-value: 1.07e-08
fit2 <- lm(points~attitude,data=imported_file) 
summary(fit2)
## 
## Call:
## lm(formula = points ~ attitude, data = imported_file)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

My result is, not surprisingly thinking about what the scatterplots looked like, that only attitude has a significant correlation with points.

The p value is approx. 0.00000000412 which is well below 0.1% chnage of getting this data if there were no relationship between attitude and points. One unit chnage in attitude raises the points with, on average, 0.35 points. Our (adjusted) R-squared tells us that attitude can explain 18% of the variation in points.

Analysis task 5

par(mfrow = c(2,2))
plot(fit2, which = c(1,2,5))

A regression analysis assumes that the relationship between the explanatory and the dependent variable is linear. It also assumes that the errors are normally distributed, have a constant variance and size of given error does not depend on the explanatory variables. The normal Q-Q shows that the erros seem to be normally distributed. If the lain looks straigth, it is a sign of this. Residuals vs leverage shows that no outlier affects the model too much, as no point lies below the dotted Cook’s distance line. The residuals vs fitted picture shows if there is any specific pattern in the residual variations, which might be the case if the relationship between the explanatory and the dependent variable is not linear and if the error variance is not constant. In our case it looks quite ok, although there are some quite high residuals at y values 24-26.


Logistic Regression Analysis

Analysis task 2

getwd()
## [1] "/home/chpatola/IODS-project"
imported_file <-read.csv(file="data/allStudents.csv")
colnames(imported_file)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
str(imported_file)
## 'data.frame':    382 obs. of  35 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : int  2 7 10 14 8 14 12 8 16 13 ...
##  $ G2        : int  8 8 10 14 12 14 12 9 17 14 ...
##  $ G3        : int  8 8 11 14 12 14 12 10 18 14 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...

The data has 382 rows (observations) and 35 columns (variables). The file contains information about students in two Portuguise schools, among others family setting, grades in math and portuguise, alcohol use and study habits. More information is available here:[link] (https://archive.ics.uci.edu/ml/datasets/Student+Performance)

Analysis task 3

The four interesting variables that I choose and want to study in order to see if there is an connection to alcohol use are: Age (numeric: from 15 to 22) Address (home addres type: binary: ‘U’ - urban or ‘R’ - rural) Absences (Amount of school absences: numeric: from 0 to 93) Final Grade (numeric: from 0 to 20)

Below are my hypothesis for each of the variables connection with alcohol consumption: Age: The older a student, the higher probability of drinking Address: It could be less to do on the countryside and students drink more there. It could also be that studnets in the city drink more because it is easier to get hold on alcohol there Absences: Students who drink much can might do it due to problems at home and/or at school. This could mean they have more absences that others Final grade: see explenation above.

Analysis task 4

library(tidyr); library(dplyr); library(ggplot2)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Bar plot distributions for each value
gather(imported_file) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped

# Boxplots
g1 <- ggplot(imported_file, aes(x = high_use, y = absences, col = address))
g2 <- ggplot(imported_file, aes(x = high_use, y = age))
g3 <- ggplot(imported_file, aes(x = high_use, y = G3))

g1 + geom_boxplot() + ylab("Amount of absences") + xlab("High alcohol use")

g2 + geom_boxplot() + ylab("Student age") + xlab("High alcohol use")

g3 + geom_boxplot() + ylab("Final grade") + xlab("High alcohol use")

#Barplot address
counts <- table(imported_file$address,imported_file$high_use)
barplot(counts, main="Students alcohol use use per address type, urban - rural",
  xlab="High alcohol use", col=c("darkblue","blue"),
  legend = rownames(counts))

cor(imported_file$G3, imported_file$high_use)
## [1] -0.1284528
cor(imported_file$absences, imported_file$high_use)
## [1] 0.2233679

Looking at the graphs, I could say that my theory about absences seem to be reasonable. Conserning age, students under 16 years do not seem to drink and more students above 17 years drink more. Final grade seems to have a slight negative correlation with high alcohol use. Distribution of “heavy drinkers” seems to be quite same in the countryside and in the city.

Analysis task 5

m <- glm(high_use ~ address + absences + G3 + age, data = imported_file, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ address + absences + G3 + age, family = "binomial", 
##     data = imported_file)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4810  -0.8301  -0.6844   1.1992   1.9067  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.28205    1.82767  -1.249 0.211807    
## addressU    -0.36549    0.27512  -1.329 0.184012    
## absences     0.08360    0.02303   3.629 0.000284 ***
## G3          -0.06054    0.03551  -1.705 0.088195 .  
## age          0.11978    0.10149   1.180 0.237918    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 439.30  on 377  degrees of freedom
## AIC: 449.3
## 
## Number of Fisher Scoring iterations: 4
#Odd ratios
odd_ratio <- coef(m)  %>% exp()

#Confidence intervals
conf_int <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(odd_ratio,conf_int)
##             odd_ratio       2.5 %   97.5 %
## (Intercept) 0.1020743 0.002753262 3.628802
## addressU    0.6938539 0.406425388 1.198479
## absences    1.0871915 1.041208391 1.139729
## G3          0.9412543 0.877558570 1.009031
## age         1.1272469 0.924299016 1.377414

AIC (information lost in model) is 449.3

When I look at the summary of the fitted model, I see that only absences has a significant (positive) relationship with high use of alcohol.

Confidence intervals that includes 1 implies that there is no real difference in output values (alcohol use) depending in the value of the input variable. As one can see, only the confidence intervals for absences excludes 1.

For a given age, G3 score and address type, the odds of a student being a havy drinker is up approx 5-14% per unit increase in absence.

Analysis task 6

?log()
?predict
model <- glm(high_use ~ absences, data = imported_file, family = "binomial")
summary(model)
## 
## Call:
## glm(formula = high_use ~ absences, family = "binomial", data = imported_file)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3754  -0.8181  -0.7282   1.2652   1.7475  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.28190    0.15920  -8.052 8.12e-16 ***
## absences     0.08982    0.02299   3.907 9.36e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 447.45  on 380  degrees of freedom
## AIC: 451.45
## 
## Number of Fisher Scoring iterations: 4
imported_file$probabilities <-predict(model, type = "response")
imported_file$prediction <-imported_file$probabilities >0.5
head(imported_file, 2)
##   school sex age address famsize Pstatus Medu Fedu    Mjob    Fjob reason
## 1     GP   F  18       U     GT3       A    4    4 at_home teacher course
## 2     GP   F  17       U     GT3       T    1    1 at_home   other course
##   nursery internet guardian traveltime studytime failures schoolsup famsup
## 1     yes       no   mother          2         2        0       yes     no
## 2      no      yes   father          1         2        0        no    yes
##   paid activities higher romantic famrel freetime goout Dalc Walc health
## 1   no         no    yes       no      4        3     4    1    1      3
## 2   no         no    yes       no      5        3     3    1    1      3
##   absences G1 G2 G3 alc_use high_use probabilities prediction
## 1        5  2  8  8       1    FALSE     0.3030510      FALSE
## 2        3  7  8  8       1    FALSE     0.2665014      FALSE
truth <-table(Using_much_alcohol =imported_file$high_use, Predicted_high_user=imported_file$prediction)

total_wrong<- truth["FALSE","TRUE"] + truth["TRUE","FALSE"]

training_error <- total_wrong/nrow(imported_file)

With the help of information on absense (only), I get a logistic regression model that predicts the drinking status with an error rat eof 29 %. AIC is 451.45, so a litle more than in the model with all 4 variables. If I compare that result to what one could achieve by just guessing, with a 50 % chance of correct guess, it is a bit better. Not a great model but better than none. However, the model is much better at identifying those that do not drink so much than identifying those that drink much. If we would like to find students that drink much (perhaps to support them in some way), the model would hardly be to any help at all as it - with this data - identifyed only 9 out of 114 students who drank much. ***

Clustering and Classification

Analysis task 2

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
head(Boston,2)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.9
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.9
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

The data has 506 rows (observations) and 14 columns (variables). All variables are numeric The file contains information about housing values in suburbs of Boston. More information is available here:[link] (https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html)

Analysis task 3

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
Boston_cor <-cor(Boston)
library(corrplot)
## corrplot 0.84 loaded
corrplot(Boston_cor, method="circle")

library(purrr)
library(tidyr)
library(ggplot2)

Boston %>%
  keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

par(mar=c(1,1,1,1))
pairs(Boston)

The strongest positive correlations can be found between tax and rad and the strongest negative between dis and nox, dis and indus, dis and age as well as medv and lstat. Looking at the description of the variables (see link above), the relationships make sense.

Concerning the distributions of values: age and black are skewed to higher values, chas, tax and rad have two groups of values, rm and medv have sort of a normal distribution and dis and crim are skewed to low values.

Analysis task 4

#Standardize boston dataset
boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
#Correct class back to dataframe
class(boston_scaled)#matrix
## [1] "matrix"
class(Boston)#data frame
## [1] "data.frame"
boston_scaled_df <- as.data.frame(boston_scaled)

#Categorical variable of crime
bins <- quantile(boston_scaled_df$crim)
crime <- cut(boston_scaled_df$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
table(crime)#We now have groups with quite same number of observations in each
## crime
##      low  med_low med_high     high 
##      127      126      126      127
#Remove original crim from the dataset
boston_scaled_df <- dplyr::select(boston_scaled_df, -crim)

# add the new categorical value to scaled data
boston_scaled_df <- data.frame(boston_scaled_df, crime)
colnames(boston_scaled_df)
##  [1] "zn"      "indus"   "chas"    "nox"     "rm"      "age"     "dis"    
##  [8] "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"    "crime"
head(boston_scaled_df,3)
##           zn      indus       chas        nox        rm        age
## 1  0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948
## 2 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824  0.3668034
## 3 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490
##        dis        rad        tax    ptratio     black      lstat
## 1 0.140075 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990
## 2 0.556609 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525
## 3 0.556609 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324
##         medv crime
## 1  0.1595278   low
## 2 -0.1014239   low
## 3  1.3229375   low
#Divide dataset into train and test
n <- nrow(boston_scaled_df)

ind <- sample(n,  size = n * 0.8)

train_boston <- boston_scaled_df[ind,]
test_boston <- boston_scaled_df[-ind,]

# save the correct classes from test data
correct_classes <- test_boston$crime

# remove the crime variable from test data
test_boston <- dplyr::select(test_boston, -crime)

When we standardized the dataset, the mean became 0 and the standard devision 1. If we were to use the data for predicting a dependent variable, the effect of each explanatory varaibel would be equal and not larger for those with a larger scale (for example 0-100 in comparance to 0-5).

Analysis task 5

lda.boston <- lda(crime ~., data = train_boston)

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train_boston$crime)

# plot the lda results
plot(lda.boston, dimen = 2, col= classes, pch = classes)
lda.arrows(lda.boston, myscale = 4)

Analysis task 6

predictions <- predict(lda.boston, newdata = test_boston)

table(correct = correct_classes, predicted = predictions$class)
##           predicted
## correct    low med_low med_high high
##   low       10      14        1    0
##   med_low    2      25        4    0
##   med_high   2       8       11    1
##   high       0       0        1   23
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
confusionMatrix(predictions$class, correct_classes)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction low med_low med_high high
##   low       10       2        2    0
##   med_low   14      25        8    0
##   med_high   1       4       11    1
##   high       0       0        1   23
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6765          
##                  95% CI : (0.5766, 0.7658)
##     No Information Rate : 0.3039          
##     P-Value [Acc > NIR] : 1.067e-14       
##                                           
##                   Kappa : 0.5598          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: low Class: med_low Class: med_high Class: high
## Sensitivity             0.40000         0.8065          0.5000      0.9583
## Specificity             0.94805         0.6901          0.9250      0.9872
## Pos Pred Value          0.71429         0.5319          0.6471      0.9583
## Neg Pred Value          0.82955         0.8909          0.8706      0.9872
## Prevalence              0.24510         0.3039          0.2157      0.2353
## Detection Rate          0.09804         0.2451          0.1078      0.2255
## Detection Prevalence    0.13725         0.4608          0.1667      0.2353
## Balanced Accuracy       0.67403         0.7483          0.7125      0.9728

When we look at the comparision on the real crime classes and the ones that the LDA model predicted, we see that it did quite well. Overall accuracy is 75 %

Analysis task 7

We did already standardize the Boston dataset in task 4.

#Calculated distances between suburbs

# euclidean distance matrix
dist_eu <- dist(boston_scaled)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
#Run k-means algorithm on the dataset
km <-kmeans(boston_scaled, centers = 3)

pairs(boston_scaled, col = km$cluster)

#Investigate what is the optimal number of clusters 
set.seed(11)
k_max <- 20

twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')

#Run k means again with different number of clusters

km <-kmeans(boston_scaled, centers = 6)

pairs(boston_scaled, col = km$cluster)

The distances between suburbs (in relation to all standardized variables) is between minimi 0.1243 to maximum 14.3970. Mean is 4.9111.

When we look at the graph, it seems that the more clusters, the better. However, too many clusters reduce the use of dividing data into peaces. At an extreme case, we have one cluster for each data point. The decrease slope flattens out quite mcuh after approx. 6-7, so we could probably choose that as opimal number of clusters.

With 6 clusters, we have divided the housing suburbs into 7 different groups, where all suburbs are more similar within the group than they are with suburbs in other groups. ***

Dimensionality reduction techniques

Analysis task 1

humanDev <- read.csv("data/human.csv")
head(humanDev, 2)
##           X fem_male_ed fem_male_job years_ed lifeexp gni_num mat_mort
## 1    Norway   1.0072389    0.8908297     17.5    81.6   64992        4
## 2 Australia   0.9968288    0.8189415     20.2    82.4   42261        6
##   ad_birth_rate parliament
## 1           7.8       39.6
## 2          12.1       30.5
library(GGally)
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
humanDev_numeric <- humanDev[,-1]
head(humanDev_numeric,2)
##   fem_male_ed fem_male_job years_ed lifeexp gni_num mat_mort ad_birth_rate
## 1   1.0072389    0.8908297     17.5    81.6   64992        4           7.8
## 2   0.9968288    0.8189415     20.2    82.4   42261        6          12.1
##   parliament
## 1       39.6
## 2       30.5
ggpairs(humanDev_numeric)

Looking at the distributions of the data and the relationships between them, we can say that many of the variables do not have connections with each other but years_ed and life exp seem to have a positive realtinship with a correlation of 0.79. Also birth_rate and mat_mort seem to have a relationship with a correlation of 0.76. Life exp and birth rate have a negative relationship with a correlation of 0.73 and mat mortality and years educ also have a negative relationship. The negative correlation is 0.74. Birt rate is also negatively correlated with years educ (0.7).

Concerning the distribution, only years education seems to have a normal distribution. fem_male educ, fem_male_job and parliament also come close to a normal distribution but the other vaiables have skewed distributions.

Analysis task 2

pca_human <- prcomp(humanDev_numeric)

summary(pca_human) #Almost all variation can be explained by the first PC
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000
biplot(pca_human, choices = 1:2, col=c("black","blue"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

Analysis task 3

#Standardize dataset
humanDev_numeric_stand <- scale(humanDev_numeric)
summary(humanDev_numeric_stand)
##   fem_male_ed       fem_male_job        years_ed          lifeexp       
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##     gni_num           mat_mort       ad_birth_rate       parliament     
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
pca_human_stand <- prcomp(humanDev_numeric_stand)

summary(pca_human_stand) #The variation is explained much more evenly across the components than with the unstandardized dataset
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000
biplot(pca_human_stand, choices = 1:2, col=c("black","blue"))
hi

hi

When we standardized the dataset, the mean became 0 and the standard devision 1. In the unstandardized datasets, variables with more variations due to broader range of values (like GNI numeric) are given much more importance in the PCA. When we standardize the variables, differences in scale do not matter anymore and we get a more balanced picture of how the different variables affet the variations in the model. We clearly see the difference between the standardized and not standardized data in our biplots and the model summaries. In the unstandardized version, GNI numeric explains most of the variation and as it affects PC1, PC1 accounts for as much as >99% of the variation. In the standardized version, we need all 8 PC:s to get up to the same explenational percent.

Analysis task 4

My interpretation on the biplot components 1 and 2 from the standardized data is: PC1 and PC2 do together explain approx 70% of the variation in the model. When one decide on how many components to use in a model, there are different ways of doing it. One way is to use components with a sd of at least 0.7 (here it would mean PC1-PC4) but one can also, for example, use graphical ways of doing it. If we now stick to only the two values PC1 and PC2 and focus on the biplot graph of them, we can say that fem_male_job and parliamanet are correlated and they affect PC2. Mat_mort and birth_rate are also corelated and they affect PC1. In general, a little angle between feature arrows indicate that they are correlated and the the PC they are in parallell with indicate that they affect that PC. The length os the arrow corresponds to the feature sd. In our model there is not so big differences between the standard deviations of the features.

Analysis task 5

library(FactoMineR)
library(dplyr)
library(tidyr)
data(tea)

str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
chosen_tea <- tea[,c(3,7,14,15)]
str(chosen_tea)
## 'data.frame':    300 obs. of  4 variables:
##  $ evening: Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ home   : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ How    : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar  : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
mca_model <- MCA(chosen_tea)

summary(mca_model)
## 
## Call:
## MCA(X = chosen_tea) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.298   0.275   0.254   0.245   0.238   0.190
## % of var.             19.860  18.304  16.933  16.364  15.878  12.661
## Cumulative % of var.  19.860  38.164  55.097  71.461  87.339 100.000
## 
## Individuals (the 10 first)
##                Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1           |  0.112  0.014  0.023 | -0.075  0.007  0.010 |  0.142  0.026
## 2           | -0.601  0.404  0.275 |  0.589  0.421  0.264 | -0.142  0.026
## 3           | -0.104  0.012  0.013 | -0.457  0.254  0.245 | -0.535  0.375
## 4           |  0.112  0.014  0.023 | -0.075  0.007  0.010 |  0.142  0.026
## 5           | -0.104  0.012  0.013 | -0.457  0.254  0.245 | -0.535  0.375
## 6           | -0.557  0.347  0.613 | -0.351  0.149  0.243 |  0.027  0.001
## 7           | -0.557  0.347  0.613 | -0.351  0.149  0.243 |  0.027  0.001
## 8           | -0.148  0.024  0.013 |  0.482  0.282  0.140 | -0.703  0.649
## 9           | -0.601  0.404  0.275 |  0.589  0.421  0.264 | -0.142  0.026
## 10          | -0.104  0.012  0.013 | -0.457  0.254  0.245 | -0.535  0.375
##               cos2  
## 1            0.037 |
## 2            0.015 |
## 3            0.335 |
## 4            0.037 |
## 5            0.335 |
## 6            0.001 |
## 7            0.001 |
## 8            0.298 |
## 9            0.015 |
## 10           0.335 |
## 
## Categories
##                 Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## evening     |   0.650  12.161   0.221   8.123 |  -0.146   0.671   0.011
## Not.evening |  -0.340   6.358   0.221  -8.123 |   0.077   0.351   0.011
## home        |  -0.077   0.486   0.193  -7.599 |   0.090   0.721   0.264
## Not.home    |   2.499  15.720   0.193   7.599 |  -2.922  23.324   0.264
## alone       |  -0.094   0.478   0.016  -2.205 |  -0.623  22.949   0.720
## lemon       |   1.276  15.020   0.201   7.754 |   0.846   7.160   0.088
## milk        |  -0.189   0.631   0.010  -1.687 |   1.347  34.669   0.482
## other       |  -1.325   4.421   0.054  -4.030 |   0.966   2.549   0.029
## No.sugar    |  -0.706  21.618   0.533 -12.624 |  -0.280   3.676   0.084
## sugar       |   0.755  23.108   0.533  12.624 |   0.299   3.929   0.084
##              v.test     Dim.3     ctr    cos2  v.test  
## evening      -1.832 |  -0.743  18.650   0.289  -9.288 |
## Not.evening   1.832 |   0.388   9.751   0.289   9.288 |
## home          8.886 |  -0.068   0.444   0.150  -6.707 |
## Not.home     -8.886 |   2.205  14.362   0.150   6.707 |
## alone       -14.674 |  -0.154   1.523   0.044  -3.635 |
## lemon         5.140 |   0.833   7.515   0.086   5.065 |
## milk         12.004 |  -0.494   5.038   0.065  -4.401 |
## other         2.938 |   3.744  41.388   0.433  11.385 |
## No.sugar     -4.997 |  -0.112   0.642   0.013  -2.009 |
## sugar         4.997 |   0.120   0.686   0.013   2.009 |
## 
## Categorical variables (eta2)
##               Dim.1 Dim.2 Dim.3  
## evening     | 0.221 0.011 0.289 |
## home        | 0.193 0.264 0.150 |
## How         | 0.245 0.739 0.563 |
## sugar       | 0.533 0.084 0.013 |
plot(mca_model, invisible=c("ind"))

The tea dataset contains 300 rows and 36 columns. The individuals not home is different from the rest of the individulas. Probably because if you are not at home (ie for example at work), you are less likely to sit down with a good cup of tea. The rst of the individuals seem to have a relatiosnhip. Interesting is that out of them, lemon and no sugar are further from each other. If you have lemon in your tea, you might be more likely to alo have sugar in it.


Longitudinal data

Analysis task 1

#Importing data and inspecting it
library(dplyr)
library(tidyr)

#rats
rats <- read.csv("data/rats.csv")
summary(rats)
##        ID            Group           time         gram      
##  Min.   : 1.00   Min.   :1.00   WD1    :16   Min.   :225.0  
##  1st Qu.: 4.75   1st Qu.:1.00   WD15   :16   1st Qu.:267.0  
##  Median : 8.50   Median :1.50   WD22   :16   Median :344.5  
##  Mean   : 8.50   Mean   :1.75   WD29   :16   Mean   :384.5  
##  3rd Qu.:12.25   3rd Qu.:2.25   WD36   :16   3rd Qu.:511.2  
##  Max.   :16.00   Max.   :3.00   WD43   :16   Max.   :628.0  
##                                 (Other):80
str(rats)
## 'data.frame':    176 obs. of  4 variables:
##  $ ID   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Group: int  1 1 1 1 1 1 1 1 2 2 ...
##  $ time : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ gram : int  240 225 245 260 255 260 275 245 410 405 ...
rats$ID <- as.factor(rats$ID)
rats$Group <- as.factor(rats$Group)
str(rats)
## 'data.frame':    176 obs. of  4 variables:
##  $ ID   : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ time : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ gram : int  240 225 245 260 255 260 275 245 410 405 ...
#Make time to integer

rats$time <- as.character(rats$time)
rats$time <- as.numeric(gsub("WD","",rats$time))
str(rats)
## 'data.frame':    176 obs. of  4 variables:
##  $ ID   : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ time : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ gram : int  240 225 245 260 255 260 275 245 410 405 ...
summary(rats)
##        ID      Group       time            gram      
##  1      : 11   1:88   Min.   : 1.00   Min.   :225.0  
##  2      : 11   2:44   1st Qu.:15.00   1st Qu.:267.0  
##  3      : 11   3:44   Median :36.00   Median :344.5  
##  4      : 11          Mean   :33.55   Mean   :384.5  
##  5      : 11          3rd Qu.:50.00   3rd Qu.:511.2  
##  6      : 11          Max.   :64.00   Max.   :628.0  
##  (Other):110
#Plot the weight development of the individual rats
library(ggplot2)

ggplot(rats, aes(x = time, y = gram, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(rats$gram), max(rats$gram)))

#Standardize the weights in order to even clearer see the weight development

rats$gram_sd <- scale(rats$gram)
summary(rats)
##        ID      Group       time            gram            gram_sd.V1     
##  1      : 11   1:88   Min.   : 1.00   Min.   :225.0   Min.   :-1.2541867  
##  2      : 11   2:44   1st Qu.:15.00   1st Qu.:267.0   1st Qu.:-0.9238953  
##  3      : 11   3:44   Median :36.00   Median :344.5   Median :-0.3144291  
##  4      : 11          Mean   :33.55   Mean   :384.5   Mean   : 0.0000000  
##  5      : 11          3rd Qu.:50.00   3rd Qu.:511.2   3rd Qu.: 0.9969062  
##  6      : 11          Max.   :64.00   Max.   :628.0   Max.   : 1.9150375  
##  (Other):110
#Plot the standardized curves

ggplot(rats, aes(x = time, y = gram_sd, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(rats$gram_sd), max(rats$gram_sd)))

#Look into summative varaible mean for the different groups

#Create new tabel with mean weight per individ

rats_summative<- rats %>%
  filter(time > 0) %>%
  group_by(Group, ID) %>%
  summarise( mean_gram=mean(gram) ) %>%
  ungroup()

str(rats_summative)
## Classes 'tbl_df', 'tbl' and 'data.frame':    16 obs. of  3 variables:
##  $ Group    : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ ID       : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ mean_gram: num  261 238 260 267 269 ...
#Visualize the mean weights per group

ggplot(rats_summative, aes(x = Group, y = mean_gram)) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=2, fill = "black") +
  scale_y_continuous(name = "mean(gram)")

#The outlier we earlier saw in group 2 is very visible in this boxplot graph. It could be vise to remove it
#We filter it out and redraw the plot

rats_summative_no_outlier <- rats_summative %>% filter(mean_gram <570)

ggplot(rats_summative_no_outlier, aes(x = Group, y = mean_gram)) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=2, fill = "black") +
  scale_y_continuous(name = "mean(gram)")

#Check that means really are different for the groups with anova test -> they are clearly!

oneway.test(mean_gram~Group,data=rats_summative_no_outlier,var.equal=TRUE) 
## 
##  One-way analysis of means
## 
## data:  mean_gram and Group
## F = 483.6, num df = 2, denom df = 12, p-value = 3.387e-12

When we plot the weight changes of the rats on an individual level, we see that the tendency is that the weight increases. Group 1 differs from group 2 and 3 as they individuals that belong to that group are much lighter than the two others, both in the start and in the end of the time period. Group 2 also has one individ whos weight, through all the weeks, is quite much higher than the rest of the rats.

If we standardize the weights and do the plot again, we see the below pictured information even clearer.

In order to further familarize ourselves with the data, we can look at it on a summative level. In the boxplot graph, we look at the means - and general distributions - of the different groups. Are they different? The boxplot clearly indicates so and also that the variation within groups is quite small but we can control it with a anova test. The p value of the test clearly shows ut the the means of the 3 different groups are highly unlikely to be same.

Analysis task 2

#Importing data and inspecting it
#bprs
bprs <- read.csv("data/bprs.csv")
bprs$treatment <- as.factor(bprs$treatment)
bprs$subject <- as.factor(bprs$subject)
str(bprs)
## 'data.frame':    360 obs. of  4 variables:
##  $ treatment : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject   : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks     : Factor w/ 9 levels "week0","week1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ bprs_score: int  42 58 54 55 72 48 71 30 41 57 ...
#Extract week number from week strings
bprs$weeks <- as.character(bprs$weeks)
str(bprs)
## 'data.frame':    360 obs. of  4 variables:
##  $ treatment : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject   : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks     : chr  "week0" "week0" "week0" "week0" ...
##  $ bprs_score: int  42 58 54 55 72 48 71 30 41 57 ...
bprs$week_int <- as.numeric(gsub("week","",bprs$weeks))
summary(bprs) #week from 0  (baseline) to 8
##  treatment    subject       weeks             bprs_score       week_int
##  1:180     1      : 18   Length:360         Min.   :18.00   Min.   :0  
##  2:180     2      : 18   Class :character   1st Qu.:27.00   1st Qu.:2  
##            3      : 18   Mode  :character   Median :35.00   Median :4  
##            4      : 18                      Mean   :37.66   Mean   :4  
##            5      : 18                      3rd Qu.:43.00   3rd Qu.:6  
##            6      : 18                      Max.   :95.00   Max.   :8  
##            (Other):252
str(bprs)
## 'data.frame':    360 obs. of  5 variables:
##  $ treatment : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject   : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks     : chr  "week0" "week0" "week0" "week0" ...
##  $ bprs_score: int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week_int  : num  0 0 0 0 0 0 0 0 0 0 ...
head(bprs,3)
##   treatment subject weeks bprs_score week_int
## 1         1       1 week0         42        0
## 2         1       2 week0         58        0
## 3         1       3 week0         54        0
#We start our analysis with pretending the observations are independent (ie, we ignore the fact that they are done on same objects during different time periods) and use a normal linear regression

dim(bprs)#360 observations on 5 variables
## [1] 360   5
bprs_reg <- lm(bprs_score ~ week_int + treatment, data = bprs)
summary(bprs_reg)
## 
## Call:
## lm(formula = bprs_score ~ week_int + treatment, data = bprs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week_int     -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16
#Plot to see signs that bprs score is dependent on treatment -> could be but difficult to say
library(ggplot2)
library(GGally)
ggpairs(bprs[,c(1,4:5)])
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Random intercept model
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
## 
##     expand
bprs_rim <- lmer(bprs_score ~ week_int + treatment + (1 | subject), data = bprs, REML = FALSE)
summary(bprs_rim)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs_score ~ week_int + treatment + (1 | subject)
##    Data: bprs
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week_int     -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) wek_nt
## week_int   -0.437       
## treatment2 -0.282  0.000
#Random Intercept and Random Slope Model
bprs_rirsm <-lmer(bprs_score ~ week_int + treatment  + (week_int | subject), data = bprs, REML = FALSE)
summary(bprs_rirsm)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs_score ~ week_int + treatment + (week_int | subject)
##    Data: bprs
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7977 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week_int     0.9609  0.9803   -0.51
##  Residual             97.4304  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week_int     -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) wek_nt
## week_int   -0.582       
## treatment2 -0.247  0.000
#Anova analysis to see defference between the two Random intercept models
anova(bprs_rirsm, bprs_rim )
## Data: bprs
## Models:
## bprs_rim: bprs_score ~ week_int + treatment + (1 | subject)
## bprs_rirsm: bprs_score ~ week_int + treatment + (week_int | subject)
##            Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
## bprs_rim    5 2748.7 2768.1 -1369.4   2738.7                           
## bprs_rirsm  7 2745.4 2772.6 -1365.7   2731.4 7.2721      2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Random Intercept and Random Slope Model with interaction
bprs_rirsmwi <- lmer(bprs_score ~ week_int * treatment  + (week_int | subject), data = bprs, REML = FALSE)
summary(bprs_rirsmwi)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs_score ~ week_int * treatment + (week_int | subject)
##    Data: bprs
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           week_int     0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                     Estimate Std. Error t value
## (Intercept)          47.8856     2.2521  21.262
## week_int             -2.6283     0.3589  -7.323
## treatment2           -2.2911     1.9090  -1.200
## week_int:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) wek_nt trtmn2
## week_int    -0.650              
## treatment2  -0.424  0.469       
## wk_nt:trtm2  0.356 -0.559 -0.840
#Anova analysis to see defference between the last two Random intercept models
anova(bprs_rirsmwi,bprs_rirsm)
## Data: bprs
## Models:
## bprs_rirsm: bprs_score ~ week_int + treatment + (week_int | subject)
## bprs_rirsmwi: bprs_score ~ week_int * treatment + (week_int | subject)
##              Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
## bprs_rirsm    7 2745.4 2772.6 -1365.7   2731.4                           
## bprs_rirsmwi  8 2744.3 2775.4 -1364.1   2728.3 3.1712      1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

When we start analysing our data and first take a normal linear regression model (see comments imbedded in code above), we an see that the time factor is significant with an average decrease in points with 2.3 per week increase. The treatment factor is not significant in this model.

Next, we will use the Random Intercept Model. It introduces a constant random “error term” connected to each subject. This may arise due to, fore example, a subject constantly exeggregating the effects measured. Most commoly reporting higher/lower values on a questionary than another subject with same “symptohms” would have done. In the summary of the model, we see that the random effects of the intercept from the subject has a standard deviation of 6.9, which can be seen to be quite high. If we look at the fixed effects, we see that “week” is still significant with an average decrease in points with 2.3 per week. The treatment term is still insignificant.

The Random Intercept and Random Slope Model that is tested next allows for each observations slope - and not only intercept - to differ. This means that not only is a random intercept added or substracted to each individual in order to estimate a point score but how much a change in time affects it. When we look at the summary of this model, we see that the sd of the random effects of subject is even higher than in the last model 8.1 and that the sd of week is only 1.0. Looking at the fixed effects, an increase on week still decreaseses point score with on average 2.3. Significance for week and treatment (=yes respectively no) is same as in the 2 earlier models.

An anova analysis on the two Random intercept outputs show us that the second model is to prefere before the first.

Our final model is the Random Intercept and Random Slope Model with interaction. It allows for an interaction between treatment and time and not only between individual and these. In the output from this model, we see similar values in the random effects as in the last model, sd for random effects of subject is still 8.1 and week 1.0. In the fixed effects, “week” is still significant with an average decrease in points with 2.7 per week. The treatment term is still insignificant.

The anova analysis of the two last models show that the third model is a little bit more precise than the second.